Introduction to Probability

9/1/2020

Agenda for today

Probability rules
Discrete random variables
Continuous random variables
Statistical inference

Introduction

Probability and statistics let us talk efficiently about things we are unsure about.

How likely is Trump to win re-election?
How much will Amazon sell next quarter?
What will the return of my financial portfolio be next year?
How often will users click on a particular Facebook ad?

All of these involve inferring or predicting unknown quantities!

Random variables

Random variables are numbers that we are not sure about, but we might have some idea of how to describe its potential outcomes. They represent uncertain events.

If these numbers can take values in a finite or countably infinite set, the random variable is said to be discrete, otherwise it is called continuous.

Example: Suppose we are about to toss two coins. Let $X$ denote the number of heads. We say that $X$ is the random variable that stands for the number we are not sure about.

Random variables

Probability

Probability is a language designed to help us talk and think about aggregate properties of random variables.

The key idea is that to each event we will assign a number between 0 and 1 which reflects how likely that event is to occur.

Basic axioms:

For any event event $A$, the probability $P(A) > 0$
If an event $A$ is certain to occur, $P(A) = 1$
If $P(A) = p$, then $P(\text{not A}) = 1 - p$
If two events $A$ and $B$ are mutually exclusive (both cannot occur simultaneously), then $P(\text{A or B}) = P(A) + P(B)$
$P(\text{A and B}) = P(A) P(B \mid A) = P(B)P(A \mid B)$

Discrete random variables

Probability distribution

Probability distributions describe the behavior of random variables.

Example: $X$ is the random variable denoting the number of heads in two independent coin tosses.

$X$ is discrete as we are able to list all the possible outcomes, i.e. $X \in \{0, 1, 2\}$.

We can describe its behavior through the following probability distribution: \[p(x) = P(X = x) = \begin{cases} 0.25 & \text{if } x = 0 \\ 0.5 & \text{if } x = 1 \\ 0.25 & \text{if } x = 2 \end{cases}\]

Question: What is $P(X = 0 \text{ and } X = 2)$? How about $P(X \geq 1)$?

Conditional, Joint and Marginal distributions

In general we want to use probability to address problems involving more than one variable at the time.

Example: returns of a financial portfolio. If we know that the economy will be growing next year, does that change the assessment about the behavior of my returns?

We need to be able to describe what we think will happen to one variable relative to another!

We need the concepts of conditional, joint and marginal distributions.

Conditional, Joint and Marginal distributions

Example: How are the returns $S$ of a portfolio impacted by the overall economy?

Let $E$ denote the performance of the economy next quarter. For simplicity, say $E = 1$ if the economy is expanding and $E = 0$ if the economy is contracting. Let’s assume $P(E = 1) = 0.7$.

A conditional probability is the chance that one thing happens, given that some other thing has already happened.

$S$	$P(S \mid E = 1)$	$P(S \mid E = 0)$
1	0.05	0.20
2	0.20	0.30
3	0.50	0.30
4	0.25	0.20

Conditional, Joint and Marginal distributions

The probability of $S = 4$ given that the economy is growing is $0.25$.

$S$	$P(S \mid E = 1)$	$P(S \mid E = 0)$
1	0.05	0.20
2	0.20	0.30
3	0.50	0.30
4	0.25	0.20

The conditional distributions tell us about about what can happen to $S$ for a given value of $E$. But what about $S$ and $E$ jointly?

\[\begin{split} P(S = 4 \text{ and } E = 1) &= P(E = 1) \cdot P(S =4 \mid E = 1) \\ &= 0.70 \cdot 0.25 = 0.175 \end{split}\]

Marginal probabilities

We just saw how to calculate the joint distribution starting from both marginals and conditionals.

If we have a joint distribution, we know everything of the two random variables: both their marginals and their correlation structure.

How to compute the marginals?
Why are marginals called marginals? \[p (E = 1) = \sum_{s} p(E = 1, S = s)\]

Exercise

Given $E = 1$ what is the probability of $S = 4$?
Given $S = 4$ what is the probability of $E = 1$?

Independence

Two random variable $X$ and $Y$ are independent if \[P(Y = y \mid X = x)= P(Y = y) \quad \forall x, y.\]

In other words, knowing $X$ tells you nothing about $Y$!

Remember rule (5)?

\[\begin{split} P(Y = y \text{ and } X = x) &= P(Y = y \mid X = x) P(X = x) \\ &= P(Y = y) P(X = x) \end{split}\]

Example: tossing a coin 2 times. What is the probability of getting $H$ in the second toss given that we saw a $T$ in the first one?

Bayes’ Rule

Remember rule (5)? \[P(\text{A and B}) = P(A) P(B \mid A) = P(B)P(A \mid B)\]

This means that \[P(B \mid A) = \frac{P(B)P(A \mid B)}{P(A)}\] This is known as Bayes’ Rule, and is used in many real world applications:

Search engines
Recommender systems
Medical testing

Satellite tracking
Self-driving cars

Mean of a random variable

The mean or expected value is defined as (for a discrete $X$): \[E(X) = \sum_{x \in D} x \cdot P(X = x)\]
We weight each possible value by how likely they are. This provides us with a measure of centrality of the distribution, i.e. a “good” prediction for $X$!

Question: what is the mean number of heads in two independent coin tosses? Remember that \[p(x) = P(X = x) = \begin{cases} 0.25 & \text{if } x = 0 \\ 0.5 & \text{if } x = 1 \\ 0.25 & \text{if } x = 2 \end{cases}\]

Probability and decisions

Revenue	$P(\text{Revenue})$
250,000$	0.7
0$	0.138
25,000,000$	0.162

The expected revenue is \[ E[\text{Revenue}] = 0.7 \cdot 250,000\$ + 0.162 \cdot 25,000,000\$ = 4,225,000\$ \]

Should we invest or not?

Probability and decisions

Revenue	$P(\text{Revenue})$
250,000$	0.7
0$	0.138
25,000,000$	0.162

Revenue	$P(\text{Revenue})$
3,721,428$	0.7
0$	0.138
10,000,000$	0.162

We add a second investment: the expected revenue is still $4,225,000\$$. What is the difference?

Variance of a random variable

The variance is defined as (for a discrete $X$): \[\text{Var}(X) = \sum_{x \in D} [x - E(X)]^{2} \cdot P(X = x) = \sum_{x \in D} x^{2} \cdot P(X = x) - (E[X])^{2}\]
Weighted average of squared prediction errors. This is a measure of the spread of a distribution. More “risky” distributions have larger variance.

Question: what is the variance for the number of heads in two independent coin tosses? Remember that \[p(x) = P(X = x) = \begin{cases} 0.25 & \text{if } x = 0 \\ 0.5 & \text{if } x = 1 \\ 0.25 & \text{if } x = 2 \end{cases}\]

Standard deviation of a random variable

What are the units of $E(X)$?
What are the units of $Var(X)$?
A more intuitive way to understand the spread of a distribution is to look at the standard deviation: $\text{sd}(X) = \sqrt{\text{Var}(X)}$
What are the units of $\text{sd}(X)$?

Continous random variables

The normal distribution

The normal distribution is the most used probability distribution to describe a continuous random variable.

The pdf of a Gaussian (normal) distribution is \[f(x) = \dfrac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{1}{2 \sigma^{2}} (x - \mu)^{2}}\]

The area under the curve (probability density function, or p.d.f), gives \[P(X \in [a, b]) = \int_{a}^{b} f(x) \ dx\]

The normal distribution

When we say “the normal distribution”, we really mean a family of distributions.

We obtain probability densities in the normal family by shifting the bell curve around and spreading it out (or tightening it up).

$X \sim \text{N}(\mu, \sigma^{2})$: “normal distribution with mean $\mu$ and variance $\sigma^{2}$”.

The normal distribution

The standard normal distribution has mean $0$ and has variance $1$, and is usually denoted by $Z$.

Notation: if the random variable $Z$ is s.t. $Z \sim \text{N}(0,1)$, then \[\begin{split} &P(−1 < Z < 1) = 0.68 \\ &P(−1.96 < Z < 1.96) = 0.95 \end{split}\]

For simplicity we will often use $P(−2 < Z < 2) \approx 0.95$.

In general, $P( \mu - 2 \sigma < X < \mu + 2 \sigma) \approx 0.95$.

The normal distribution

We often use $\mu$ as our best guess for a prediction.
The parameter $\sigma$ indicates how spread out the distribution is. This gives us and indication about how uncertain or how risky our prediction is.

Example: below are the pdfs of $X_{1} \sim \text{N}(0, 1)$, $X_{2} \sim \text{N}(3, 1)$, and $X_{3} \sim \text{N}(0, 16)$. Which pdf goes with which $X$?

The normal distribution

In $X \sim \text{N}(\mu, \sigma^{2})$, $\mu$ is the mean and $\sigma^{2}$ is the variance.

Standardization: if $X \sim \text{N}(\mu, \sigma^{2})$, then \[Z = \frac{X - \mu}{\sigma} \sim \text{N}(0, 1)\]

Summary: $X \sim \text{N}(\mu, \sigma^{2})$:

$\mu$: where the curve is centered at
$\sigma$: how spread out the curve is
95% chance $X \in \mu \pm 2\sigma$

Multivariate normal distribution

\[(X, Y) \sim \text{N}_{2}(\boldsymbol{\mu}, \Sigma)\]

$\boldsymbol{\mu}$: vector of marginal locations for $X$ and $Y$
$\Sigma$: matrix with marginal variances for $X$ and $Y$ and with their correlation

\[P(X \in A, Y \in B) = \int_{A} \int_{B} f(x, y)\ dxdy\]

Multivariate normal distribution

The marginal distributions of a bivariate normal distribution are normal distributions themselves!

How to compute a marginal from the joint?

\[f_{X}(x) = \int_{\mathcal{Y}} f_{(X, Y)}(x, y) \ dy,\] where $\mathcal{Y}$ is the domain of the random variable $Y$ relative to $X = x$.

Check out this example.

Expectation, Variance, Correlation

\[\begin{split} &E[X] = \int x \cdot f(x) \ dx \\ &\text{Var}[X] = \int x^{2} \cdot f(x) \ dx - (E[X])^{2} \end{split}\]

You will practice all these calculations in the homework!

Correlation

It measures the linear relationship between two random variables.

\[\rho = \text{Cor}(X, Y) = \iint x y \cdot f(x, y) \ dx dy - E[X] E[Y]\]

More on this when we talk about linear regression.

If two variables are independent, then $\rho = 0$.

The converse is NOT true.

Why is this even important?

Supervised learning: study of how $Y$ varies with $X$!

Statistical Inference

Statistical inference

Until now, we have supposed to know the probabilities associated to random events.

In practice, we do not know them, but we have data. The role of a statistician is to assume a statistical model, and estimate its optimal parameters given the data.

One way of doing it is maximum likelihood. More on this in the first homework!

Not only we are interested in the optimal parameters of a statistical model. We might also want to quantify the uncertainty around them: confidence intervals!

Example

We assume that the data are independent and identically distributed as a normal distribution: the likelihood is \[L_{\theta}(x_{1}, \dots, x_{n}) = p(x_{1} \mid \theta) \cdot \dots \cdot p(x_{n} \mid \theta)\]

Statistical inference

Steps:

We want to infer a parameter, say $\mu$, representing the height of the UT college population
We collect a sample of data of UT students, $x_1, \dots, x_n$
We compute an estimate, in this case the average $\bar{x}$. This represents the exact average height in my sample, NOT the true mean height of the population
To quantify uncertainty, we calculate a CI, in this case $\left[\bar{x} + t_{\alpha/2,n-1} \frac{s_n}{\sqrt{n}}; \bar{x} + t_{1 - \alpha/2,n-1} \frac{s_n}{\sqrt{n}}\right]$, where $\alpha$ is the significance level

If we were to repeat the procedure $100$ times, $\mu$ would be in the CI approximately 95 times out of 100 (in the case $\alpha = 5\%$).